A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

机译：GpU上的内存带宽高效混合基数排序

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Sorting is at the core of many database operations, such as index creation,sort-merge joins, and user-requested output sorting. As GPUs are emerging as apromising platform to accelerate various operations, sorting on GPUs becomes aviable endeavour. Over the past few years, several improvements have beenproposed for sorting on GPUs, leading to the first radix sort implementationsthat achieve a sorting rate of over one billion 32-bit keys per second. Yet,state-of-the-art approaches are heavily memory bandwidth-bound, as they requiresubstantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memorytransfers and, therefore, considerably lifts the memory bandwidth limitation.Being able to sort two gigabytes of eight-byte records in as little as 50milliseconds, our approach achieves a 2.32-fold improvement over thestate-of-the-art GPU-based radix sort for uniform distributions, sustaining aminimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed theavailable device memory, we build on our efficient GPU sorting approach with apipelined heterogeneous sorting algorithm that mitigates the overheadassociated with PCIe data transfers. Comparing the end-to-end sortingperformance to the state-of-the-art CPU-based radix sort running 16 threads,our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement forsorting 64 GB key-value pairs with a skewed and a uniform distribution,respectively.

机译：排序是许多数据库操作的核心，例如索引创建，排序合并联接和用户请求的输出排序。随着GPU逐渐成为有希望的平台来加速各种操作，在GPU上进行分类变得可行。在过去的几年中，已提出了对在GPU上进行排序的一些改进，从而导致了第一个基数排序实现，该实现实现了每秒超过十亿个32位键的排序速率。但是，最先进的方法在内存带宽上有很大的限制，因为与基于CPU的方法相比，它们所需的内存传输量要大得多。我们的工作提出了一种新颖的方法，几乎减少了一半的内存传输量，因此大大提高了内存带宽限制。由于能够在短短的50毫秒内对2 GB的8字节记录进行排序，因此我们的方法比以前提高了2.32倍先进的基于GPU的基数排序可实现均匀分布，对于偏斜的分布，其最小加速至少保持1.66倍。为了解决输入未驻留在GPU上或超出可用设备内存的问题，我们在有效的GPU排序方法的基础上采用了流水线的异构排序算法，可减轻与PCIe数据传输相关的开销。将端到端排序性能与运行16个线程的基于CPU的最新基数排序进行比较，我们的异构方法对以偏斜方式排序64 GB键值对实现了2.06倍和1.53倍的改进。并分别分布均匀。

著录项

作者
Stehle, Elias; Jacobsen, Hans-Arno;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Fast Four-Way Parallel Radix Sorting on GPUs [J] . Linh Ha, Jens Kr¨uger, Claudio T. Silva Computer Graphics Forum: Journal of the European Association for Computer Graphics . 2010,第8期

机译：GPU上的快速四向并行基数排序
2. HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING [J] . DUANE MERRILL and ANDREW GRIMSHAW Parallel Processing Letters . 2011,第2期

机译：高性能和可缩放的基排序：以GPU计算实现动态并行为例
3. HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING [J] . DUANE MERRILL, ANDREW GRIMSHAW Parallel Processing Letters . 2011,第2期

机译：高性能和可伸缩的基数排序：以GPU计算实现动态并行为例的研究
4. Parallelization of bitonic sort and radix sort algorithms on many core GPUs [C] . Yildiz Zehra, Aydin Musa, Yilmaz Guray International Conference on Electronics, Computer and Computation . 2013

机译：许多核心GPU上的bionic分类和基数分类算法的并行化
5. Mixed -radix shuffle -based permutation and sorting networks [D] . Mashburn, Brian C. 2005

机译：基于混合基数混洗的置换和排序网络
6. A Flexible Hybrid BCH Decoder for Modern NAND Flash Memories Using General Purpose Graphical Processing Units (GPGPUs) [O] . Arul Subbiah, Tokunbo Ogunfunmi 2019

机译：使用通用图形处理单元（GPGPU）的现代NAND闪存的灵活混合BCH解码器
7. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs [O] . Stehle, Elias, Jacobsen, Hans-Arno 2017

机译：GpU上的内存带宽高效混合基数排序

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅